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Abstract 

A new result in convex analysis on the calculation of proximity operators in certain scaled 
norms is derived. We describe efficient implementations of the proximity calculation for a useful 
class of functions; the implementations exploit the piece-wise linear nature of the dual problem. 
The second part of the paper applies the previous result to acceleration of convex minimization 
problems, and leads to an elegant quasi-Newton method. The optimization method compares 
favorably against state-of-the-art alternatives. The algorithm has extensive applications including 
signal processing, sparse recovery and machine learning and classification. 

1 Introduction 

Convex optimization has proved to be extremely useful to all quantitative disciplines of science. A 
common trend in modern science is the increase in size of datasets, which drives the need for more 
efficient optimization schemes. For large-scale unconstrained smooth convex problems, two classes of 
methods have seen the most success: limited memory quasi-Newton methods and non-linear conjugate 
gradient (CG) methods. Both of these methods generally outperform simpler methods, such as gradient 
descent. 

For problems with non-smooth terms and/or constraints, it is possible to generalize gradient descent 
with proximal gradient descent (which includes projected gradient descent as a sub-cases), which is 
just the application of the forward-backward algorithm pQ. 

Unlike gradient descent, it is not easy to adapt quasi-Newton and CG methods to problems involving 
constraints and non-smooth terms. Much work has been written on the topic, and approaches generally 
follow an active-set methodology. In the limit, as the active-set is correctly identified, the methods 
behave similar to their unconstrained counterparts. These methods have seen success, but are not as 
efficient or as elegant as the unconstrained versions. In particular, a sub-problem on the active-set 
must be solved, and the accuracy of this sub-iteration must be tuned with heuristics in order to obtain 
competitive results. 

1.1 Problem statement 

Let H = (R N , (•, •)) equipped with the usual Euclidean scalar product (x, y) — X};=i x iV% an d associated 
norm ||x|| = \J (x, x). For a matrix V € M> NxN in the symmetric positive-definite (SDP) cone S++(iV), 
we define "Hy = (R N , (•, -) v ) with the scalar product (x,y) v — (x, Vy) and norm \\x\\ v corresponding 
to the metric induced by V. The dual space of Hy, under (•, •), is Hy-i. We denote I« the identity 
operator on %. 
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A real- valued function / : % — > RU{+oo} is (0)-coercive if lini|| I ||_i. +00 / (x) = +00. The domain of 
/ is defined by dom / = {x £ H : f(x) < +00} and / is proper if dom / ^ 0. We say that a real-valued 
function / is lower semi- continuous (lsc) if lim inf X ^ XQ f(x) > /(xq). The class of all proper lsc convex 
functions from H to E U {+00} is denoted by r (^). The conjugate or Legendre-Fenchel transform of 
/ on % is denoted /* . 

Our goal is the generic minimization of functions of the form 

min {F(x) 4 f( x ) + h (x)} , (P) 

lEH 

where /, h 6 To(H). We also assume the set of minimizers is nonempty (e.g. F is coercive) and that a 
standard domain qualification holds. We take / € C 1 (M Ar ) with L-Lipschitz continuous gradient, and 
we assume h is separable. Write x* to denote an element of ArgminF(:c). 

The class we consider covers non-smooth convex optimization problems, including those with convex 
constraints. Here are some examples in regression, machine learning and classification. 

Example 1 (LASSO). 

mmhAx-bWl + XWxh . (1) 
xen z 

Example 2 (Non-negative least-squares (NNLS)). 

min — ||Ae — 6| 1 1 subject to x^O. (2) 

Example 3 (Sparse Support Vector Machines). One would like to find a linear decision function which 
minimizes the objective 

^ m 

min— }L((x,Zi)+b,yi) + X\\x\\i (3) 
xen m ^-^ 

i—l 

where for i = 1, • • • ,m, (zi, yi) 6 R x {±1} is the training set, and L is a smooth loss function with 
Lip schitz- continuous gradient such as the squared hinge loss L(j)i,yi) — max(0, 1 — yiyi) 2 or the logistic 
lossL{y hyi )=\og{l + e-y^). 

1.2 Contributions 

This paper introduces a class of scaled norms for which we can compute a proximity operator; these 
results themselves are significant, for previous results only cover diagonal scaling (the diagonal scal- 
ing result is trivial). Then, motivated by the discrepancy between constrained and unconstrained 
performance, we define a class of limited-memory quasi-Newton methods to solve ([F]) and that ex- 
tends naturally and elegantly from the unconstrained to the constrained case. Most well-known quasi- 
Newton methods for constrained problems, such as L-BFGS-B [2], are only applicable to box constraints 
I < x < u. The power of our approach is that it applies to a wide-variety of useful non-smooth func- 



tionals (see ^3.1.4 for a list) and that it does not rely on an active-set strategy. The approach uses the 
zero-memory SRI algorithm, and we provide evidence that the non-diagonal term provides significant 
improvements over diagonal Hessians. 

2 Quasi-Newton forward-backward splitting 
2.1 The algorithm 

In the following, define the quadratic approximation 

Qf{x) = f(x k ) + (Vf{x k ),x -x k ) + ^\\x- x k \\%, (4) 
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where B G S++(JV). 

The standard (non relaxed) version of the forward-backward splitting algorithm (also known as 
proximal or projected gradient descent) to solve ([P]) updates to a new iterate x k +i according to 

x k +i = argmin Qf k (x) + h(x) = prox tkh (x k - t k Vf(x k )) (5) 

X 

with Bk = t^Ifi, tk G]0,2/L[ (typically tk = l/L unless a line search is used). 

Note that this specializes to the gradient descent when h = 0. Therefore, if / is a strictly convex 
quadratic function and one takes Bk — W 2 f(xk), then we obtain the Newton method. Let's get back 
to h 7^ 0. It is now well known that fixed B = is usually a poor choice. Since / is smooth and 
can be approximated by a quadratic, and inspired by quasi-Newton methods, this suggest picking Bk 
as an approximation of the Hessian. Here we propose a diagonal+rank 1 approximation. 

Our diagonal+rank 1 quasi-Newton forward-backward splitting algorithm is listed in Algorithm [l] 
(with details for the quasi-Newton update in Algorithm]^] see f|4]for details) . These algorithms are listed 
as simply as possible to emphasize their important components; the actual software used for numerical 
tests is open-source and available at http://www.greyc.ensicaen.fr/~jfadili/software.html 

Algorithm 1: Zero-memory Symmetric Rank 1 (0SR1) algorithm to solve min / + h 
Require: xq G dom(/ + h), Lipschitz constant estimate L of V/, stopping criterion e 
1: for k = 1,2,3, ... do 

2: S k <- X k - X k -l 

3: Vk «- V/(x fc ) - V/(x fc _i) 

4: Compute Hk via Algorithm^ and define Bk = H^ 1 . 

5: Compute the rank-1 proximity operator (see fj3| 

Xk+i <- proxf fc (x k - H k Vf(x k )) (6) 
6: pk Xk+i — Xk and terminate if \\pk\\ < e 

7: Line-search along the ray Xk + tp k to determine Xk+i, or choose t = 1. 
8: end for 



2.2 Relation to prior work 

First-order methods The algorithm in ^ is variously known as proximal descent or iterated 
shrinkage/thresholding algorithm (1ST or ISTA). It has a grounded convergence theory, and also admits 
over-relaxation factors a G (0, 1) [3]- 

The spectral projected gradient (SPG) [4 method was designed as an extension of the Barzilai- 
Borwein spectral step-length method to constrained problems. In j5], it was extended to non-smooth 
problems by allowing general proximity operators; we refer to this as SPG/SpaRSA (N.B. we do not 
use the SpaRSA implementation since we do not use warm-starts or restarts, in order to be fair to 
all algorithms). The Barzilai-Borwein method [5] use a specific choice of step-length t k motivated by 
quasi-Newton methods. Numerical evidence suggests the SPG/SpaRSA method is highly effective, 
although convergence results are not as strong as for ISTA. 

FISTA [7] is a multi-step accelerated version of ISTA inspired by the work of Nesterov. The 
stepsize t is chosen in a similar way to ISTA; in our implementation, we tweak the original approach 
by using a Barzilai-Borwein step size, a standard line search, and restart jS], since this led to improved 
performance. Nesterov acceleration can be viewed as an over-relaxed version of ISTA with a specific, 
non-constant over-relaxation parameter a k - 

The above approaches assume B k is a constant diagonal. The general diagonal case was considered 
in several papers in the 1980s as a simple quasi-Newton method, but never widely adapted. More recent 
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attempts include a static choice B^ = B for a primal-dual method |||. A convergence rate analysis of 
forward-backward splitting with static and variable B^ where one of the operators is maximal strongly 
monotone is given in |10j . 

Active set approaches Active set methods take a simple step, such as gradient projection, to 
identify active variables, and then uses a more advanced quadratic model to solve for the free variables. 
A well-known such method is L-BFGS-B [5J [TT] which handles general box-constrained problems; we 
test an updated version [T2]. A recent bound-constrained solver is ASA [13] which uses a conjugate 
gradient (CG) solver on the free variables, and shows good results compared to L-BFGS-B, SPG, 
GENCAN and TRON. We also compare to several active set approaches specialized for t\ penalties: 
"Orthant-wise Learning" (OWL) [it], "Projected Scaled Sub-gradient + Active Set" (PSSas) [T5] . 
"Fixed-point continuation + Active Set" (FPC_AS) [TBJ, and "CG + 1ST" (CGIST) [17]. 

Other approaches By transforming the problem into a standard conic programming problem, the 
generic problem is amenable to interior-point methods (IPM). IPM requires solving a Newton-step 
equation, so first-order like "Hessian-free" variants of IPM solve the Newton-step approximately, either 
by approximately solving the equation or by subsampling the Hessian. The main issues are speed and 
robust stopping criteria for the approximations. 

Yet another approach is to include the non-smooth h term in the quadratic approximation. Yu et 
al. [TS] propose a non-smooth modification of BFGS and L-BFGS, and test on problems where h is 
typically a hinge-loss or related function. 

The projected quasi-Newton (PQN) algorithm [19l [20] is perhaps the most elegant and logical 
extension of quasi-Newton methods, but it involves solving a sub-iteration. PQN proposes the SPG [I] 
algorithm for the subproblems, and finds that this is an efficient tradeoff whenever the cost function 
(which is not involved in the sub-iteration) is relatively much more expensive to evaluate than projecting 
onto the constraints. Again, the cost of the sub-problem solver (and a suitable stopping criteria for 
this inner solve) are issues. As discussed in [21], it is possible to generalize PQN to general non-smooth 
problems whenever the proximity operator is known (since, as mentioned above, it is possible to extend 
SPG to this case). 

3 Proximity operators and proximal calculus 

We only recall essential definitions. More notions and results from convex analysis can be found in j |A") 

Definition 4 (Proximity operator [22]). Let h € Tq{T-C). Then, for every x E H, the function z M- 
\ \\x — z\ + h(z) achieves its infimum at a unique point denoted by prox^x. The uniquely-valued 
operator prox h : % — > % thus defined is the proximity operator or proximal mapping of h. 

3.1 Proximal calculus in %y 

Throughout, we denote proxj^ = (I-u v + V~ 1 dh)~ 1 , where dh is the subdifferential of h, the proximity 
operator of h w.r.t. the norm endowing Hy for some V E S++(N). Note that since V E S++(iV), the 
proximity operator prox^ is well-defined. 

Lemma 5 (Moreau identity in Hv)- Let h g Po('H) ) then for any x E W 

prox^, (x) + pV~ x o prox^ 1 oV(x/p) =i,V0<p< +oo . (7) 
The proof is in j pTT) 
Corollary 6. 

pmx^(x) = x — V^ 1 o proxjf, oV(x) . (8) 
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3.1.1 Diagonal+rank-1: General case 

Theorem 7 (Proximity operator in Hy)- Let h € Tq(H) and V = D + uu T , where D is diagonal with 
(strictly) positive diagonal elements di, and u £ WL N . Then. 

prox)f(x) = D- 1 ' 2 o prox^-v.p 1 /^ - v) , (9) 

where v = aD~^' 2 u and a is the unique root of 

p{a) = (u,x- D~ 1/2 o prox^-1/2 oD 1/2 {x - aD^u)) + a , (10) 



which is a Lipschitz continuous and strictly increasing function on R with Lipschitz constant 1 + 



The proof is in SB. 2 



Remark 8. 

• Computing proxj^ amounts to solving a scalar optimization problem that involves the computation 
o/prox^ o£) -i/2 . The latter can be much simpler to compute as D is diagonal (beyond the obvious 
separable case that we will consider shortly). This is typically the case when h is the indicator 
of the £i-ball or the canonical simple. The corresponding projector can be obtained in expected 
complexity 0(N log N) by simple sorting the absolute values 

• It is of course straightforward to compute prox^* from prox^ either using Theorem^ or using 
this theorem together with Corollary^ and the Sherman-Morrison inversion lemma. 

3.1.2 Diagonal+rank-1: Separable case 

The following corollary is key to our novel optimization algorithm. 

Corollary 9. Assume that h € To(H) is separable, i.e. h(x) — ^Zj=i hi{ x i)i an d V = D + uu T , where 
D is diagonal with (strictly) positive diagonal elements di, and u G R N . Then, 

prox^(ir) = (j>Tox hi/di (xi - «iM)). > ( n ) 

where v = au and a is the unique root of 

p(a) = (u,x- (proyL h . /dz (xi - mt;/dj)J ^ + a , (12) 

which is a Lipschitz continuous and strictly increasing function on K. 



Proof. As h is separable and D G S ++ (N) is diagonal, applying Theorem[7]together with Lemma 2^ii) 



(iii) the desired result follows. □ 



Proposition 10. Assume that for 1 i ^ N , prox^. is piecewise affine on K with ki > 1 segments, 
i.e. 

prox h .(Xi) = ajXi + b j} tj ^ Xi ^ t j+i ,j £ {l,...,fcj . 
Let k = JZj—i fej- Then prox^(ir) can be obtained exactly by sorting at most the k real values 

f£(*i-*i)i 

V * / (i,j)e{i,...,JV}x{i,...,fei} 
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Proof. Recall that (10) has a unique solution. When prox^. is piecewise affine with fe, segments, it 
is easy to see that p(a) in ( |l2"| ) is also piecewise afhne with slopes and intercepts changing at the k 

transition points ( — (x; —tA] . To get a*, it is sufficient to isolate the unique 

segment that intersects the abscissa axis. This can be achieved by sorting the values of the transition 
points which can cost in average complexity O(fclogfc). □ 

Remark 11. 

The above computational cost can be reduced in many situations by exploiting e.g. symmetry of 
the h^s, identical functions, etc. This turns out to be the case for many functions of interest, 
e.g. t\-norm, indicator of the (.co-ball or the positive orthant, and many others; see examples 
hereafter. 

Corollary^ can be extended to the "block" separable (i.e. separable in subsets of coordinates) 
when D is piecewise constant along the same block indices. 



3.1.3 Semi-smooth Newton method 

In many situations (see examples below), the root of p(a) can be found exactly in polynomial com- 



plexity. If no closed-form is available, one can appeal to some efficient iterative method to solve ( 10 1 



(or (12)). As p is Lipschitz-continuous, hence so-called Newton (slantly) differentiable, semi-smooth 
Newton are good such solvers, with the proviso that one can design a simple slanting function which 
can be algorithmically exploited. 



The semi-smooth Newton method for the solution of ( 10 ) can be stated as the iteration 

a t +\ = a t - g(at)' 1 p(a t ) , (13) 
where g is a generalized derivative of p. 

Proposition 12 (Generalized derivative of p). Ifprox hoD -i/2 is Newton differentiable with generalized 
derivative G, then so is the mapping p with a generalized derivative 

g{a) = 1 + (u, D~ 1/2 o G(D 1/2 x - aD- 1/2 u) o D~ 1 / 2 i 



Furthermore, g is nonsingular with a uniformly bounded inverse on M. 

Proof. This follows from linearity and the chain rule [23l Lemma 3.5]. The second statement follows 
strict increasing monotonicity of p as established in Theorem [7] □ 

Thus, as p is Newton differentiable with nonsingular generalized derivative whose inverse is also 



bounded, the general semi-smooth Newton convergence theorem implies that ( 13 ) converges super 



linearly to the unique root of ( 10 ) 



3.1.4 Examples 

Many functions can be handled very efficiently using our results above. For instance, Table [l] summa- 
rizes a few of them where we can obtain either an exact answer by sorting when possible, or else by 



minimizing w.r.t. to a scalar variable {i.e. finding the unique root of (10)). 

To put Propositior |10| on a more concrete footing, we briefly cover the positivity constraint explicitly. 
Let V = D + uu T and h(x) = ii x . x ^o}- We will calculate 

proxjf 1 (a;) = argmiai||y-a:||^-i (14) 
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Function h 



Algorithm 



£i-norm 

Hinge 

^oo-ball 

Box constraint 

Positivity constraint 

£i-ball 

^oo-norm 

Canonical simplex 
max function 



Separable: exact in 0(N log N) 
Separable: exact in 0(N log N) 

Separable: exact in 0(N log N) from ^-norm by Moreau-identity 
Separable: exact in 0(N log N) 
Separable: exact in 0(N log N) 

Nonseparable: semismooth Newton and prox hoD -i/ 2 costs 0(N log N) 
Nonseparable: from projector on the ^i-ball by Moreau-identity 
Nonseparable: semismooth Newton and prox^^-i/2 costs 0(N log N) 
Nonseparable: from projector on the simplex by Moreau-identity 



Table 1: Summary of functions which have efficiently computable rank-1 proximity operators 



Since we work with V 1 and not V, we will not use p(a) but rather p(a) which will be used in a similar 
way to p. 



If (y, A) is a primal-dual solution to (14), the KKT conditions must be satisfied: 
y > 0, A ^ 0, y T \ = 0, y = x + (D + uu T )\ 



(15) 



Define the scalar a — u T X. The key observation is that if a is known, then the problem is solved since 
it becomes separable and the solution is 



Vi = (xi + aui), , A 4 = (-(xi + aui)/di) + , i = 1, 



where (xi) + := max(0,a;i). Let A 

:.T\(a 



(a) 



{—{xi + aui)/di),, so we search for a value of a such that 



a A^', or in other words, a root of p(a) — a — u T \( a \ 



Define on to be the sorted values of (—Xi/ui), so we see that p is linear in the regions [on, di+i] and 
so it is trivial to check if p has a root in this region. Thus the problem is reduced to finding the correct 
region i, which can be done efficiently by a binary search over log 2 (n) values of i since p is monotonic. 
To see that p is monotonic, we write it as 



p(a) = a + ^2 {(uiXi + auj)/di) Xi(a) 



where Xi{°) encodes the positivity constraint in the argument of (•) , and is thus either or 1, hence 
the slope is always positive. 



4 A primal rank 1 SRI algorithm 

Following the conventional quasi-Newton notation, we let B denote an approximation to the Hessian 
of / and H denote an approximation to the inverse Hessian. All quasi-Newton methods update an 
approximation to the (inverse) Hessian that satisfies the secant condition: 

H k y k = s kl y k = Vf(x k ) -S7f(xk-i), s k =x k -x k -i (16) 

Algorithm [I] follows the SRI method [23], which uses a rank-1 update to the inverse Hessian 
approximation at every step. The SRI method is perhaps less well-known than BFGS, but it has the 
crucial property that updates are rank-1, rather than rank-2, and it is described "[SRI] has now taken 
its place alongside the BFGS method as the pre-eminent updating formula." [25 . 
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We propose two important modifications to SRI. The first is to use limited- memory, as is commonly 
done with BFGS. In particular, we use zero-memory, which means that at every iteration, a new 
diagonal plus rank-one matrix is formed. The other modification is to extend the SRI method to the 
general setting of minimizing f+h where / is smooth but h need not be smooth; this further generalizes 
the case when h is an indicator function of a convex set. Every step of the algorithm replaces / with 
a quadratic approximation, and keeps h unchanged. Because h is left unchanged, the subgradient of h 
is used in an implicit manner, in comparison to methods such as |18j that use an approximation to h 
as well and therefore take an explicit subgradient step. 



Algorithm 2: Sub-routine to compute the approximate inverse Hessian Hk 
Require: k, Sk, Vk, < 7 < 1, < r min < 
1: if k = 1 then 

2: H rlfi where r > is arbitrary 
3: "Life <- 
4: else 

5: t bb2 <— ^ fc '^ {Barzilai-Borwein step length} 

6: Project r B B2 onto [r min ,r max ] 

7: H <- 7TBB2l« 

8: if (sk-H y kl y k }<10- s \\y k \\ 2 \\s k ~Hoy k \\2 then 

9: u k <— {Skip the quasi-Newton update} 

10: else 

11: u k <- (s fe - H y k )/ v /(s k - H Q y k ,y k )). 
12: end if 
13: end if 

14: return H k = Hq + u k u\" {B k = H^ 1 can be computed via the Sherman-Morrison formula} 



Choosing Hq In our experience, the choice of Hq is best if scaled with a Barzilai-Borwein spectral 
step length 

T"BB2 = (s k ,yk) I (Vk, Vk) (17) 
(we call it tbb2 to distinguish it from the other Barzilai-Borwein step size tbbi = (s^Sfc) / {s kl y k ) 

TBB2)- 

In SRI methods, the quantity (s k — Hoy k ,y k ) must be positive in order to have a well-defined 
update for u k . The update is: 

H k = H + UkUk, u k = (s k - H y k )/ \J (s k - H y k ,y k ). (18) 

For this reason, we choose H = 7TBB2IH with < 7 < 1, and thus < (a* — H y k ,y k ) = (1 — 
7) (sk,Hk)- If (sk,Vk) = 0, then there is no symmetric rank-one update that satisfies the secant 
condition. The inequality (sk,yk) > is the curvature condition, and it is guaranteed for all strictly 
convex objectives. Following the recommendation in [26] . we skip updates whenever (sk, yk) cannot be 
guaranteed to be non-zero given standard floating-point precision. 

A value of 7 = 0.8 works well in most situations. We have tested picking 7 adaptively, as well as 
trying Hq to be non-constant on the diagonal, but found no consistent improvements. 

5 Numerical experiments and comparisons 

Consider the unconstrained LASSO problem ([T]). Many codes, such as [37] and L-BFGS-B 0, handle 
only non-negativity or box-constraints. Using the standard change of variables by introducing the 
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positive and negative parts of x, the LASSO can be recast as 

min \\\Ax + - Ax- - b\\ 2 + Xl T (x + + xJ) 

and then x is recovered via x — x + — X-. With such a formulation solvers such as L-BFGS-B are 
applicable. However, this constrained problem has twice the number of variables, and the Hessian of 

/ A T A -A T A\ 

the quadratic part changes from A T A to A = I ^ T . ^ T . J which necessarily has (at least) n 

degenerate eigenvalues and adversely affects solvers. 

A similar situation occurs with the hinge-loss function. Consider the shifted and reversed hinge 
loss function h(x) = max(0, x). Then one can split x = x + — X-, add constraints x + ^ 0,x_ 0, and 
replace h(x) with t T (x+). As before, the Hessian gains n degenerate eigenvalues. 



We compared our proposed algorithm on the LASSO problem. The first example, in Fig. 1 i is 
a typical example from compressed sensing that takes A £ l mXn to have iid AF(0,1) entries with 
m = 1500 and n = 3000. We set A = 0.1. L-BFGS-B does very well, followed closely by our proposed 
SRI algorithm and PSSas. Note that L-BFGS-B and ASA are in Fortran and C, respectively (the 
other algorithms are in Matlab). 

Our second example uses a square operator A with dimensions n — 13 3 = 2197 chosen as a 
3D discrete differential operator. This example stems from a numerical analysis problem to solve a 
discretized PDE as suggested by [35]. For this example, we set A = 1. For all the solvers, we use the 
same parameters as in the previous example. Unlike the previous example, Fig. |l|b| now shows that 
L-BFGS-B is very slow on this problem. The FPC-AS method, very slow on the earlier test, is now 
the fastest. However, just as before, our SRI method is nearly as good as the best algorithm. This 
robustness is one benefit of our approach, since the method does not rely on active-set identifying 
parameters and inner iteration tolerances. 



6 Conclusions 

In this paper, we proposed a novel variable metric (quasi-Newton) forward-backward splitting algo- 
rithm, designed to efficiently solve non-smooth convex problems structured as the sum of a smooth 
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term and a non-smooth one. We introduced a class of weighted norms induced by a diagonal+rank 
1 symmetric positive definite matrices, and proposed a whole framework to compute a proximity op- 
erator in the weighted norm. The latter result is distinctly new and is of independent interest. We 
also provided clear evidence that the non-diagonal term provides significant acceleration over diagonal 
matrices. 

The proposed method can be extended in several ways. Although we focused on forward-backward 
splitting, our approach can be easily extended to the new generalized forward-backward algorithm of 
|29j . However, if we switch to a primal-dual setting, which is desirable because it can handle more 
complicated objective functionals, updating is non-obvious. Though one can think of non-diagonal 
pre-conditioning methods. 

Another improvement would be to derive efficient calculation for rank-2 proximity terms, thus 
allowing a 0-memory BFGS method. We are able to extend (result not presented here) Theorem [7] to 
diagonal+rank r matrices. However, in general, one must solve an r-dimensional inner problem using 
the semismooth Newton method. 

A final possible extension is to take Bk to be diagonal plus rank-1 on diagonal blocks, since if h is 



separable, this is still can be solved by our algorithm (see Remark 10). The challenge here is adapting 
this to a robust quasi-Newton update. For some matrices that are well-approximated by low-rank 
blocks, such as H-matrices [30 , it may be possible to choose Bk = B to be a fixed preconditioner. 
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A Elements from convex analysis 

We here collect some results from convex analysis that are key for our proof. Some lemmata are listed 
without proof and can be either easily proved or found in standard references such as [311 ITj. 



A.l Background 
Functions 

Definition 13 (Indicator function). Let C a nonempty subset ofTL. The indicator function iq of C is 



«c(sc) 



I). if x G C , 
+00, otherwise. 



dom(ze) = C. 

Definition 14 (Infimal convolution). Let h\ and h 2 two functions from H toMU{+oo}. Their infimal 
convolution is the function from H to K U {±00} defined by: 

{hi V h 2 )(x) = inf {h\(x\) + h 2 (x 2 ) : x\ + x 2 — x} — inf hi(y) + h 2 (x - y) . 

Conjugacy 

Definition 15 (Conjugate). Let h :% — > RU{+oo} having a minorizing affine function. The conjugate 
or Legendre-Fenchel transform of h onli. is the function h* defined by 

h*(v) — sup (v,x) — h{x) . 

a;£dom(/i) 

Lemma 16 (Calculus rules). 

(i) (h{x)+t)*(v) = h*(v)-t. 

(ii) (th(x))*(v)= tf*(v/t), t>0. 

(hi) (h o A)* = h* o (A^ 1 ^ if A is a linear invertible operator. 

(iv) (h(x - x ))*(v) = h*(v) + (v,x ). 

(v) Separability: (Yn=i hi( x i))* ( v i> ,mm ."«) = Yh=i K( v i)> where (x\, ■ ■ ■ , x n ) € %i X • • • X % n . 

(vi) Conjugate of a sum: assume h\ 1 h 2 6 Tq{J€) and the relative interiors of their domains have a 
nonempty intersection. Then 

(hi + h 2 )* = h\ V h* 2 . 

(vii) Conjugate in Hy for V G S++(iV); h v (u) = h*(Vu). 

Lemma 17 (Conjugate of a degenerate quadratic function). Let Q be a symmetric positive semi- 
definite matrix. Let Q + be its Moore-Penrose pseudo-inverse. Then, 

1 \\y--\\lX(v) = { 1 * lly ~ vfQ+ tf^w + MQ), ( i9) 

2 y / I +00 otherwise . 

Lemma 18 (Conjugate of a rank-1 quadratic function). Let u e TL. Then, 

hu,.fX („,= {S (2 „) 
z / I +00 otherwise. 
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Subdifferential 

Definition 19 (Subdifferential). The subdifferential of a proper convex function h G r ("H) at x G W 
is the set-valued map dh : H — > 2 n 

dh{x) = {ve H\Vz G H, h{z) > h(x) +{v,z- x)} . 

An element v of dh is called a subgradient. 

The subdifferential map dh is a maximal monotone operator from T-L — > 2^. 

Lemma 20. If h is (Gateaux) differentiable at x, its only subgradient at x is its gradient \7h(x). 

Lemma 21. Let V E §++(7V). Then Vdh is the subdifferential of h in Hv ■ 

Fenchel-Rockafellar duality The duality formula to be stated shortly will be very useful through- 
out the rest of the paper. 

Lemma 22. Let h G r ("H) and g G T (IC), and A = L ■ —y : T-L — > JC be a bounded affine operator, 
and K, — (M m , (■, ■)). Suppose that G ri (domg — A (domft,)). Then 

inf h(x) + g o A(x) = — min/i*(— L*u) + g*(u) + (u, y) , (21) 

with the relashionships between x* and u* , respectively the solutions of the primal and dual problems 

h(x*) + h*(-L*u k ) = (-L*u*,x*) , (22) 

g(Ax*)+g*(u*) = (u*,Ax*), (23) 

or equivalently 

x* £dh*(-L*u*) and u* G dg{Ax*) , (24) 

-L*u*€dh(x*) and Ax* G dg*{u) . (25) 

A. 2 Proximal calculus in % 

Definition 23 (Moreau envelope [12]). The function p h(x) — inf 2S ^ ^ ||a: — z\\ 2 + h(z) for < p < 
+oo is the Moreau envelope of index p of h. 

p h is also the infimal convolution of h with i ||-|| . 

Lemma 24. 

(i) Translation: prox h (._ y - ) (x) = y + prox h (x — y). 

(ii) Scaling: Vp G (— co, oo), prox^.^rr) = pTox p2 f(px)/ p. 

(hi) Separability : let (/ii)i<i<„ a family of functions each in To(K) and h(x) = 2i=i hi{ x i)- Then h 
is in T (H) and prox h = (prox^)^^. 

Lemma 25. Let h G ro("H). Then its Moreau envelope p h is convex and Frechet- differentiable with 
\ I p-Lipschitz gradient 

V?h = (l n - P rox ph )/p. (26) 
Lemma 26 (Moreau identity). Let h G T a (T-L), then for any x G H 

pmx ph ,(x) + pprox h/p {x/p) = x,V < p < +oo . (27) 
From Lemma |26| we conclude that 

prox,^ = l H - prox h , pmx h , (x) G dh(x) . 
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B Proofs 

B.l Proof of Lemma [5] 

Proof. We have 

p = proxJ^Or) = {l Hv + V- 1 ph*)- 1 {x) & 



v(x-p)ed(ph*)( P ) 

p€dh(V(x-p)/p) 

Vx/p-{Vx-Vp)/pe Vd(h/p){V(x-p)/p) 
V(x - p)/ P = (I nv + Vd{h/p))-\Vx) 
x=p + pV- 1 o(I nv +Vd(h/ P ))- 1 (Vx) . 



B.2 Proof of Theorem [7] 

Proof. Let p — prox^(x). Then, we have to solve 



1 2 
min - — z\\ v + h(z) 
z 2 

mmf- \\zf D 



y = D 1/2 z,q = D 1/2 u mini 

y 



(Lemma [22pT| |) min 



(x, z) D + h{z) j + (x — z, uu T (x — z)} 
in Q || y || 2 - (i? 1 ^ y ^ + ft + (dV* x _ y , gg r (i? v 2a . _ y) 

^!/ 2 x, j+fto C- 1 / 2 ^) * (-«) + ((i? 1 / 2 x - qq T (D x l 2 x - •) 

/ 2 IMI / 

(Le mma |lfvi)|(iii)t ^(Ql|-H 2 -(^ 1/2 ».-)) V(^oDV2)j H+ H + ^1/ 2;C( ^ 



(Lemma 18 and Lemma If iv) I 4=> min . 

1 — 1 " — ' veRg V 2 



<S=> mm , 
ueKg V V 2 



D 1 l 2 x 



(Definition EH mm ^/i* o D 1 / 2 ) (D 1 / 2 ^ - w) + ^ 2 + (d 1/2 x, 



By virtue of Lemma 



with Lemma 22 24) 



25 



2NI 

1 (/i* o D 1 / 2 ) is continuously differentiable with 1-Lipschitz gradient. Together 



30 and 26 this yields 



p = D- 1 ' 2 oV (h*oD 1 / 2 )(D^ 2 x-v*) = D- 1 ^ 2 o(l n ~pTox h , oDl/ 2)(D 1 / 2 x-v*) 

= D~ 1/2 o pvox hoD - 1/2 oD 1/2 (x - D~ 1/2 v*), 



where v* is the unique solution to the above dual problem ( 28 1 . This problem amounts to minimizing a 



proper convex smooth continuously differentiable objective with a Lipschitz gradient over a linear set. 
The latter can be parametrized by a real scalar a such that v — aq = aD^ 1 / 2 u 1 and is then equivalent 
to solving the scalar strongly convex smooth optimization problem 



min l (h* o D 1/2 \ (D 1/2 x - aD- 1/2 u) + — +a(x,u) , 
aem. V / 2 



(29) 



(r) 



(28) 
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whose solution a* is unique. This is equivalent to saying that a* is the unique root of 
p(a) = (u, x — D^ 1 / 2 o prox /loD -i/2 oD 1 ^ 2 (x — aD^u)) + a , 



(30) 



where we used again Lemma 25 and 26 Lipschitz continuity of p(ot) follows from non-expansiveness 
of the proximal mapping, and the Lipschitz constant is straightforward from the triangle and Cauchy- 
Schwartz inequalities. 

Let's turn now to strict increasing monotonicity of p. Let f3 > a. Denote the operator P = 
proXhoD-i/s- Then, 



p{(3)-p{a) = {P-a)- {p- 1/2 u,PoD 1/2 {x- f3D- 1 u)-PoD ll2 {x~aD- 1 u)^ 



(f3 - a) + {f3 - a)~ L {-(P - a)D 



-1/2, 



P(D 1/2 x - PD~ 1/2 u) - P(D L/2 x - aD 



-1/2 



1/2, 



"1/2, 



> (0 - a) + (/? - a)- 1 P(D 1 / 2 x- pD- 1 / 2 u)-P(D 1 / 2 x~aD- 1 ' 2 u) 

> 0, 

where the first inequality is a consequence of the fact that the proximal mapping is firmly non-expansive. 

□ 
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